Analyst: Zac Lewis
This housing dataset was downloaded from Kaggle on 2022/08/25 (Y/m/d)
from https://www.kaggle.com/datasets/arathipraj/house-data
The dataset consists of Price of Houses in King County , Washington from sales between May 2014 and May 2015. Along with house price it consists of information on 18 house features, date of sale and ID of sale.
On first glance, some of these variables will undoubtedly be describing the same things (e.g. sqft_living and sqft_above, sqft_below) and as such, will probably need to be removed to simplify the analysis. It also appears that 'grade' will be able to be easily converted to a ordinal variable with 3 values ('low','avg','high').
In the following cell, I import some basic packages that I find essential and generally helpful in the process.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
Now, I will load the dataset into a pandas DataFrame object.
df = pd.read_csv('house_data.csv')
df.head()
df.tail()
| id | date | price | bedrooms | bathrooms | sqft_living | sqft_lot | floors | waterfront | view | ... | grade | sqft_above | sqft_basement | yr_built | yr_renovated | zipcode | lat | long | sqft_living15 | sqft_lot15 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 21608 | 263000018 | 20140521T000000 | 360000 | 3 | 2.50 | 1530 | 1131 | 3.0 | 0 | 0 | ... | 8 | 1530 | 0 | 2009 | 0 | 98103 | 47.6993 | -122.346 | 1530 | 1509 |
| 21609 | 6600060120 | 20150223T000000 | 400000 | 4 | 2.50 | 2310 | 5813 | 2.0 | 0 | 0 | ... | 8 | 2310 | 0 | 2014 | 0 | 98146 | 47.5107 | -122.362 | 1830 | 7200 |
| 21610 | 1523300141 | 20140623T000000 | 402101 | 2 | 0.75 | 1020 | 1350 | 2.0 | 0 | 0 | ... | 7 | 1020 | 0 | 2009 | 0 | 98144 | 47.5944 | -122.299 | 1020 | 2007 |
| 21611 | 291310100 | 20150116T000000 | 400000 | 3 | 2.50 | 1600 | 2388 | 2.0 | 0 | 0 | ... | 8 | 1600 | 0 | 2004 | 0 | 98027 | 47.5345 | -122.069 | 1410 | 1287 |
| 21612 | 1523300157 | 20141015T000000 | 325000 | 2 | 0.75 | 1020 | 1076 | 2.0 | 0 | 0 | ... | 7 | 1020 | 0 | 2008 | 0 | 98144 | 47.5941 | -122.299 | 1020 | 1357 |
5 rows × 21 columns
This cell will use pandas methods to check the DataFrame for missing or null values.
miss_check = df.isna().values.any()
null_check = df.isnull().values.any()
print(f'There are missing data: {miss_check}')
print(f'There are null data: {null_check}')
There are missing data: False There are null data: False
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 21613 entries, 0 to 21612 Data columns (total 21 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 id 21613 non-null int64 1 date 21613 non-null object 2 price 21613 non-null int64 3 bedrooms 21613 non-null int64 4 bathrooms 21613 non-null float64 5 sqft_living 21613 non-null int64 6 sqft_lot 21613 non-null int64 7 floors 21613 non-null float64 8 waterfront 21613 non-null int64 9 view 21613 non-null int64 10 condition 21613 non-null int64 11 grade 21613 non-null int64 12 sqft_above 21613 non-null int64 13 sqft_basement 21613 non-null int64 14 yr_built 21613 non-null int64 15 yr_renovated 21613 non-null int64 16 zipcode 21613 non-null int64 17 lat 21613 non-null float64 18 long 21613 non-null float64 19 sqft_living15 21613 non-null int64 20 sqft_lot15 21613 non-null int64 dtypes: float64(4), int64(16), object(1) memory usage: 3.5+ MB
df2.view.astype("category")
df.describe()
| id | price | bedrooms | bathrooms | sqft_living | sqft_lot | floors | waterfront | view | condition | grade | sqft_above | sqft_basement | yr_built | yr_renovated | zipcode | lat | long | sqft_living15 | sqft_lot15 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 2.161300e+04 | 2.161300e+04 | 21613.000000 | 21613.000000 | 21613.000000 | 2.161300e+04 | 21613.000000 | 21613.000000 | 21613.000000 | 21613.000000 | 21613.000000 | 21613.000000 | 21613.000000 | 21613.000000 | 21613.000000 | 21613.000000 | 21613.000000 | 21613.000000 | 21613.000000 | 21613.000000 |
| mean | 4.580302e+09 | 5.400881e+05 | 3.370842 | 2.114757 | 2079.899736 | 1.510697e+04 | 1.494309 | 0.007542 | 0.234303 | 3.409430 | 7.656873 | 1788.390691 | 291.509045 | 1971.005136 | 84.402258 | 98077.939805 | 47.560053 | -122.213896 | 1986.552492 | 12768.455652 |
| std | 2.876566e+09 | 3.671272e+05 | 0.930062 | 0.770163 | 918.440897 | 4.142051e+04 | 0.539989 | 0.086517 | 0.766318 | 0.650743 | 1.175459 | 828.090978 | 442.575043 | 29.373411 | 401.679240 | 53.505026 | 0.138564 | 0.140828 | 685.391304 | 27304.179631 |
| min | 1.000102e+06 | 7.500000e+04 | 0.000000 | 0.000000 | 290.000000 | 5.200000e+02 | 1.000000 | 0.000000 | 0.000000 | 1.000000 | 1.000000 | 290.000000 | 0.000000 | 1900.000000 | 0.000000 | 98001.000000 | 47.155900 | -122.519000 | 399.000000 | 651.000000 |
| 25% | 2.123049e+09 | 3.219500e+05 | 3.000000 | 1.750000 | 1427.000000 | 5.040000e+03 | 1.000000 | 0.000000 | 0.000000 | 3.000000 | 7.000000 | 1190.000000 | 0.000000 | 1951.000000 | 0.000000 | 98033.000000 | 47.471000 | -122.328000 | 1490.000000 | 5100.000000 |
| 50% | 3.904930e+09 | 4.500000e+05 | 3.000000 | 2.250000 | 1910.000000 | 7.618000e+03 | 1.500000 | 0.000000 | 0.000000 | 3.000000 | 7.000000 | 1560.000000 | 0.000000 | 1975.000000 | 0.000000 | 98065.000000 | 47.571800 | -122.230000 | 1840.000000 | 7620.000000 |
| 75% | 7.308900e+09 | 6.450000e+05 | 4.000000 | 2.500000 | 2550.000000 | 1.068800e+04 | 2.000000 | 0.000000 | 0.000000 | 4.000000 | 8.000000 | 2210.000000 | 560.000000 | 1997.000000 | 0.000000 | 98118.000000 | 47.678000 | -122.125000 | 2360.000000 | 10083.000000 |
| max | 9.900000e+09 | 7.700000e+06 | 33.000000 | 8.000000 | 13540.000000 | 1.651359e+06 | 3.500000 | 1.000000 | 4.000000 | 5.000000 | 13.000000 | 9410.000000 | 4820.000000 | 2015.000000 | 2015.000000 | 98199.000000 | 47.777600 | -121.315000 | 6210.000000 | 871200.000000 |
df.nunique()
id 21436 date 372 price 4028 bedrooms 13 bathrooms 30 sqft_living 1038 sqft_lot 9782 floors 6 waterfront 2 view 5 condition 5 grade 12 sqft_above 946 sqft_basement 306 yr_built 116 yr_renovated 70 zipcode 70 lat 5034 long 752 sqft_living15 777 sqft_lot15 8689 dtype: int64
df.duplicated().sum()
0
I will use seaborn for visualizations.
import seaborn as sns
sns.set()
sns.set_style('whitegrid')
sns.distplot(df['price'])
c:\Users\Zac\AppData\Local\Programs\Python\Python310\lib\site-packages\seaborn\distributions.py:2619: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms). warnings.warn(msg, FutureWarning)
<AxesSubplot:xlabel='price', ylabel='Density'>
sns.pairplot(data=df)
<seaborn.axisgrid.PairGrid at 0x2af7db8ebc0>
sns.heatmap(df.corr())
<AxesSubplot:>
df2 = df.copy()
df2 = df2.drop(['id','lat','long', 'sqft_living'], axis=1)
df2.corr().sort_values('price', ascending=False)
| price | bedrooms | bathrooms | sqft_lot | floors | waterfront | view | condition | grade | sqft_above | sqft_basement | yr_built | yr_renovated | zipcode | sqft_living15 | sqft_lot15 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| price | 1.000000 | 0.308350 | 0.525138 | 0.089661 | 0.256794 | 0.266369 | 0.397293 | 0.036362 | 0.667434 | 0.605567 | 0.323816 | 0.054012 | 0.126434 | -0.053203 | 0.585379 | 0.082447 |
| grade | 0.667434 | 0.356967 | 0.664983 | 0.113621 | 0.458183 | 0.082775 | 0.251321 | -0.144674 | 1.000000 | 0.755923 | 0.168392 | 0.446963 | 0.014414 | -0.184862 | 0.713202 | 0.119248 |
| sqft_above | 0.605567 | 0.477600 | 0.685342 | 0.183512 | 0.523885 | 0.072075 | 0.167649 | -0.158214 | 0.755923 | 1.000000 | -0.051943 | 0.423898 | 0.023285 | -0.261190 | 0.731870 | 0.194050 |
| sqft_living15 | 0.585379 | 0.391638 | 0.568634 | 0.144608 | 0.279885 | 0.086463 | 0.280439 | -0.092824 | 0.713202 | 0.731870 | 0.200355 | 0.326229 | -0.002673 | -0.279033 | 1.000000 | 0.183192 |
| bathrooms | 0.525138 | 0.515884 | 1.000000 | 0.087740 | 0.500653 | 0.063744 | 0.187737 | -0.124982 | 0.664983 | 0.685342 | 0.283770 | 0.506019 | 0.050739 | -0.203866 | 0.568634 | 0.087175 |
| view | 0.397293 | 0.079532 | 0.187737 | 0.074710 | 0.029444 | 0.401857 | 1.000000 | 0.045990 | 0.251321 | 0.167649 | 0.276947 | -0.053440 | 0.103917 | 0.084827 | 0.280439 | 0.072575 |
| sqft_basement | 0.323816 | 0.303093 | 0.283770 | 0.015286 | -0.245705 | 0.080588 | 0.276947 | 0.174105 | 0.168392 | -0.051943 | 1.000000 | -0.133124 | 0.071323 | 0.074845 | 0.200355 | 0.017276 |
| bedrooms | 0.308350 | 1.000000 | 0.515884 | 0.031703 | 0.175429 | -0.006582 | 0.079532 | 0.028472 | 0.356967 | 0.477600 | 0.303093 | 0.154178 | 0.018841 | -0.152668 | 0.391638 | 0.029244 |
| waterfront | 0.266369 | -0.006582 | 0.063744 | 0.021604 | 0.023698 | 1.000000 | 0.401857 | 0.016653 | 0.082775 | 0.072075 | 0.080588 | -0.026161 | 0.092885 | 0.030285 | 0.086463 | 0.030703 |
| floors | 0.256794 | 0.175429 | 0.500653 | -0.005201 | 1.000000 | 0.023698 | 0.029444 | -0.263768 | 0.458183 | 0.523885 | -0.245705 | 0.489319 | 0.006338 | -0.059121 | 0.279885 | -0.011269 |
| yr_renovated | 0.126434 | 0.018841 | 0.050739 | 0.007644 | 0.006338 | 0.092885 | 0.103917 | -0.060618 | 0.014414 | 0.023285 | 0.071323 | -0.224874 | 1.000000 | 0.064357 | -0.002673 | 0.007854 |
| sqft_lot | 0.089661 | 0.031703 | 0.087740 | 1.000000 | -0.005201 | 0.021604 | 0.074710 | -0.008958 | 0.113621 | 0.183512 | 0.015286 | 0.053080 | 0.007644 | -0.129574 | 0.144608 | 0.718557 |
| sqft_lot15 | 0.082447 | 0.029244 | 0.087175 | 0.718557 | -0.011269 | 0.030703 | 0.072575 | -0.003406 | 0.119248 | 0.194050 | 0.017276 | 0.070958 | 0.007854 | -0.147221 | 0.183192 | 1.000000 |
| yr_built | 0.054012 | 0.154178 | 0.506019 | 0.053080 | 0.489319 | -0.026161 | -0.053440 | -0.361417 | 0.446963 | 0.423898 | -0.133124 | 1.000000 | -0.224874 | -0.346869 | 0.326229 | 0.070958 |
| condition | 0.036362 | 0.028472 | -0.124982 | -0.008958 | -0.263768 | 0.016653 | 0.045990 | 1.000000 | -0.144674 | -0.158214 | 0.174105 | -0.361417 | -0.060618 | 0.003026 | -0.092824 | -0.003406 |
| zipcode | -0.053203 | -0.152668 | -0.203866 | -0.129574 | -0.059121 | 0.030285 | 0.084827 | 0.003026 | -0.184862 | -0.261190 | 0.074845 | -0.346869 | 0.064357 | 1.000000 | -0.279033 | -0.147221 |
from sklearn.preprocessing import OneHotEncoder
enc = OneHotEncoder()
enc.fit(df2['view'].array.reshape(-1,1))
view_col_names = enc.get_feature_names_out()
enc.transform([[0],[1],[2],[3],[4]]).toarray()
array([[1., 0., 0., 0., 0.],
[0., 1., 0., 0., 0.],
[0., 0., 1., 0., 0.],
[0., 0., 0., 1., 0.],
[0., 0., 0., 0., 1.]])
sns.boxplot(x=df2.condition,y=df2.grade, hue=df2.waterfront)
<AxesSubplot:xlabel='condition', ylabel='grade'>